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Abstract 

In this paper we consider the Nonnegative Matrix Factorization (NMF) 
problem: given an (elementwise) nonnegative matrix V £ R™^" find, for 
assigned k, nonnegative matrices W € R!^^'° and H G M.*^" such that 
V = WH. Exact, non trivial, nonnegative factorizations do not always 
exist, hence it is interesting to pose the approximate NMF problem. The 
criterion which is commonly employed is I-divergence between nonnega- 
tive matrices. The problem becomes that of finding, for assigned k, the 
factorization WH closest to V in I-divergence. An iterative algorithm, 
EM like, for the construction of the best pair {W, H) has been proposed 
in the literature. In this paper we interpret the algorithm as an alternat- 
ing minimization procedure a la Csiszar-Tusnady and investigate some of 
its stability properties. NMF is widespreading as a data analysis method 
in applications for which the positivity constraint is relevant. There are 
other data analysis methods which impose some form of nonnegativity: 
we discuss here the connections between NMF and Archetypal Analysis. 
An interesting system theoretic application of NMF is to the problem of 
approximate realization of Hidden Markov Models. 
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1 Introduction 



The approximate Nonnegative Matrix Factorization (NMF) of nonnegative ma- 
trices is a data analysis technique only recently introduced |B1 110| . Roughly 
speaking the problem is to find, for a given nonnegative matrix V G R™^", and 
an assigned fc, a pair of nonnegative matrices W G M™^'"' and H e M'^" such 
that, in an appropriate sense, V « WH. EM like algorithms for the construc- 
tion of a factorization have been proposed in (5117]. In ^U] the connection of 
these algorithms with the classic alternating minimization of the I-divergence ^ 
has been pointed out but not fully investigated. In this paper we pose the NMF 
problem as a minimum I-divergence problem that can be solved by alternating 
minimization and derive, from this point of view, the algorithm proposed in [S]. 

Although only recently introduced the NMF has found many applications 
as a data reduction procedure and has been advocated as an alternative to 
Principal Components Analysis (PCA) in cases where the positivity constraint 
is relevant (typically image analysis). The title of ^HI is a clear indication of 
this point of view, but a complete analysis of the relations between NMF and 
PCA is still lacking. Other data analysis methods proposed in the literature 
enforce some form of positivity constraint and it is useful to investigate the 
connection between NMF and these methods. An interesting example is the so 
called Archetypal Analysis (A A) technique [2]. Assigned a matrix X G K™^" 
and an integer fc, the AA problem is to find, in the convex hull of the columns 
of X, a set of k vectors whose convex combinations can optimally represent X. 
To understand the relation between NMF and AA we choose the L2 criterion 
for both problems. For any matrix A and positive definite matrix S define 
Plls = (tr(ATEA))i/2. Denote = ||A||. The solution of the NMF 

problem is then 

(W,H) = argmin||T/-M/^ff|| 

W,H 

where the minimization is constrained to the proper set of matrices. The solu- 
tion to the AA problem is given by the pair of column stochastic matrices {A, B) 
of respective sizes k x n and m x k such that ||X — X_BA|| is minimized (the 
constraint to column stochastic matrices is imposed by the convexity). Since 
||X - XBA\ \ = \\I - BA\\xTx the solution of the AA problem is 

(A,B) = argmin||/-BA||xTx. 

AA and NMF can therefore be viewed as special cases of a more general problem 
which can be stated as follows. Given any matrix P e K™^", any positive 
definite matrix E, and any integer fc, find the best nonnegative factorization 
P ~ Q1Q2 (with Qi e M!^'''^ Q2 e R+''") in the L2 sense, i.e. 

(Qi,Q2) = arg min ||P-QiQ2||s. 

Qi 1Q2 

Our interest in NMF stems from the system theoretic problem of approximate 
realization (or order reduction) of Hidden Markov Models. Partial results have 
already been obtained 



2 



2 Preliminaries and problem statement 



The NMF is a long standing problem in linear algebra [51 IH). It can be stated 
as follows. Given V G R™^", and 1 < fc < min(m, n), find a pair of matrices 
W e and iJ G M^""" such that V = WH. The smallest k for which a 

factorization exists is called the positive rank of V, denoted prank(F). This 
definition implies that rank(V^) < prank(l^) < min(m,n). It is well known that 
prank(\^) can assume all intermediate values, depending on V. Examples for 
which nonnegative factorizations do not exist, and examples for which factor- 
ization is possible only for k > rank(y) are easily constructed The prank 
has been characterized only for special classes of matrices |H] and algorithms for 
the construction of a NMF are not known. The approximate NMF has been 
recently introduced in |BJ independently from the exact NMF problem. The set- 
up is the same, but instead of exact factorization it is required that V ~ WH in 
an appropriate sense. In |H| and in this paper the approximation is to be under- 
stood in the sense of minimum I-divergence. For two nonnegative matrices (or 
vectors) M ~ (Mij) and N — {Nij) of the same size the I-divergence is defined 
as 

D{M\\N) = J2{M^, log ^ - A% + iV,,), 

with the conventions 0/0 = 0, OlogO — and p/0 = oo for p > 0. ^From the 
inequality x log a; > a; — 1 it follows that D{M\\N) > with equality iff M ~ N. 
The problem of approximate NMF is to find 

aigmm D(V\\WH). 

W,H 

It can be shown that, if Vij > 0, the minimum is attained. Dropping constants 
the problem is equivalent to finding 



mn^F{W,H) ^(F., log(I¥i/),, - (WH),,). 



Clearly the solution is not unique. In order to rule out too many trivial multiple 
solutions, we impose the condition that H is row stochastic, so J^j ^ij — 1 for 
all /. This is not a restriction. Indeed, excluding without loss of generality the 
case where H has one or more zero rows, let h be the diagonal matrix with 
elements /i, = t^ien WH = WH with W = Wh, H = h-^H and 

H is by construction row stochastic. The convention that H is row stochastic 
still doesn't rule out non-uniqueness. Think e.g. of post-multiplying W with a 
permutation matrix 11 and pre- multiplying H with n~^. 

Although the function F is concave in each of its arguments W and H 
separately, it does not have this property as a function of two variables. Hence 
F may have several (local) maxima, that may prevent numerical algorithms for 
a global maximum search to converge to the global maximizer. 
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Let e (e^) be a column (row) vector of appropriate dimension whose elements 
are all equal to one. The (constrained) problem we will look at is then 

max FiW,H). (1) 

W,H:He=e 

Notice that the constrained problem ^ can be rewritten as 

To carry out the maximization numerically OEI propose an iterative algorithm. 
Denoting by and iJ" the matrices at step n, the update equations are the 
following 

Ti/n rrn 

- l^^^i (^W^H'-),/ {W^W%- 

There is no rationale for this algorithm although the update steps ^ and ^ 
are like those in the EM algorithm, known from statistics, see Likewise the 
convergence properties of the algorithm arc unclear. In the next section we will 
cast the maximization problem in a different way that provides more insight in 
the specific form of the update equations. 



3 Lifted version of the problem 

In this section we lift the I-divergence minimization problem to an equivalent 
minimization problem where the 'matrices' (we should speak of tensors) have 
three indices. Because we insist on probabilistic interpretations we change no- 
tations as follows. P S R'J*^" is a given, fixed matrix and 

P = {P e : E/ = ^(U)}, 

Q = {Q e MV""'="" : Q(*;j) = Q-{il)Q+m, Q-W, Q+{lj) > 0, Q+e = e}, 
Q = {Qe Mlf'^" : Qiij) = E/ Q{ilj) for some Q e Q}. 

Notice that Q is the class of to x n matrices that admit exact NMF of size 
k. In the notation of section 2, V has become P, and W,H are now Q-,Q+ 
respectively. 

The following observation (whose proof is elementary, see [S]) motivates our 
approach. 
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Lemma 3.1 P can be factorized as P = Q-Q+ iff V f\ Q ^ $, so iff there 
exists a P £ P and Q G Q such that P = Q. 

For a probabilistic interpretation of this lemma, and of the results below, we 
assume (without loss of generality) that P represents the joint distribution of 
a three dimensional random vector. Suppose that Y_ and y+ are finite val- 
ued random variables defined on a probability space T, P) whose joint dis- 
tribution is given by P(Y_ — i,Yj^ = j) = P{ij). Then the content of the 
lemma is that there exists a finite valued random variable X such that and 
1+ are conditionally independent given X iff P = The matrix Q_ 

then gives the joint distribution of Y- and X by Q-{il) = P(F_ = i^X = I), 
whereas the matrix can be interpreted as conditional distributions of Y~^ 
given X via Q+{lj) = P(l+ = j\X = I). Moreover, in this case we have 
P(F_ = i,X = Z,y+ = j) = Q{ilj). To see this we write the conditional 
independence relation 

p(y_ =i,Y+ = j\x = i) = p(y_ = i\x = i)¥{Y+ = j\x = I) 

in equivalent form as 

p(y_ =i,x = i,Y+ = j) = p(y_ ^i,x^ /)p(y+ = j |x = o, 

from which the above statements immediately follow. 



4 Two partial minimization problems 

In this section we consider the following two minimization problems. In the first 
one we minimize for given Q G Q the I-divergcnce £'(P||Q) over P G P. In the 
second problem we minimize for given P G 7-* the I-divergence D(P||Q) over 
Q G Q. The unique solution P* = P*(Q) to the first problem can be computed 
analytically and is given by 

where Q{ij) — Yli QiHj)- A direct computation gives the useful relation 
D{P*{Cim)=D{P\\Q). 



The interpretation in terms of random variables is that for a given probability 
measure Q, random variables y_,X, y+ with law Q{Y- = i,X = l,Y+ = j) = 
Q{ilj), the best approximating model P* with marginal distribution of y = 

(y_,y+) described by P is given by 

P*{ilj) = P(y- =i,X = I, y+ = j) 

= q{X = l\Y_=t,Y+=j)P{i,j). 
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Equivalently, we can say that P* is such that the marginal distribution of Y 
under F* is given by P and the conditional distribution of X given Y under P* 
is equal to the conditional distribution under Q. Below we will see that this is 
not a coincidence. 

The solution Q* = Q* (P) to the second problem is given by 

Q*_{il) = J2^{ilj) (5) 

i 

The interpretation in probabilistic terms is that for a given distribution P of 
, X, y+ ) , the best model Q* that makes Y- and 1+ conditionally independent 
given X is such that 

Q*(r_ =i,X = 1) = P{Y_ =i,X = l) 

and 

Q*{Y+ = j\X = l,Y_=i) = Q*(y+ = 3\X = l)= P(y+ = j\X = I). 

We see that the optimal solution Q* is such that the marginal distributions of 
(X, y_ ) under P and Q* coincide and that the same happens for the conditional 
distributions of 1+ given X . Again, this is not a coincidence, as we will explain 
below. First we will state for the two partial minimization problems above the 
following two Pythagorean rules. 

Lemma 4.1 For fixed P and Q* = Q*(P) it holds that for any Q G Q 

D{P\\Q)=D{nQ*) + D{Q*\\Q), (7) 

whereas for fixed Q and P* = P*(Q) it holds that for any P G 7-* 

I)(P||Q) = I)(P||P*)+I)(P*||Q), (8) 



D(P*\\Q)^D{P\\Q), (9) 
where Q is given by Q{ij) = Q(*^j)- 

Proof. To prove the first relation we first introduce some notation. Let 
P{il-) = E,P(*0-), Pi-lj) - and Pij\l) = P(-Zj-)/EjP(-0-)- For 

Q we use similar notation and so wo have Q(zZ-) = Q^{il) and Q(j|/) = 
Q+ilj)/Ej Q+iW and Q*_{il) = P{il-) and Ql{lj) = PU\l). Consider 

ilj 
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On the other hand we have 

^(QIIQ) = E P(*^-)P(J IO(iog ^ + log 

The first assertion follows. The second Pythagorean rule follows from 
D(P||P') + D(P'||Q) 

= ^(P|IQ)- 

□ 

For a probabilistic interpretation of the P* and Q* above as well as the Pytha- 
gorean rules we use a general result on the I-divergence between two joint laws 
of a random vector {U, V). We denote the law of this vector under probability 
measures P and Q by P^'^ and Q^'^ . The conditional distributions of U given 

V are summarized by the matrices P^l^ and Q^'^, with the obvious convention 
P^\^{ij) = T{U = i\V = j) and likewise for Q^\^ . 

Lemma 4.2 It holds that 

Z?(P^'^||g^'^) =EpZ?(P^I^||Q^I^) + i5(P^||Q^), (10) 

where 

Proof. This follows from elementary manipulations. □ 

The above relation can be refined as follows. Suppose that V is bivariate, 

V — (Vi, V2) say and that U and V2 are conditionally independent given Vi under 
Q, so the conditional distribution of U given V is the same as the conditional 
distribution of U given Vi under Q. Then the first term on the right hand side 
of equation can be decomposed as 

^rD{P^\^\\Q^\^) ^^rD{P^\^\\P^^^^) +^vD{P^^^^\\Q^\^^). (11) 

We apply this lemma to the first partial minimization problem above by an 
appropriate choice of U and V. Since 7:>(P*||Q) = D{P\\Q) = D(P^\\Q^), 
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where is given by P, we see that tor U = X, V = Y = (Y^,Y+) the 
decomposition ((SJ can alternatively be written as EpD{P^\^ \\Q^\^)+D{P\\Q). 
Minimizing Z)(P| |Q) w.r.t. P under the condition that the marginal of P is given 
by P is thus equivalent to minimizing the I-divergence between the conditional 
distributions P^l^ and Q"^'^, and this clearly happens for p-^^^ = Q^^^ . 
The interpretation of is less straightforward. However, refining |(7J), we have 
parallel to ((TT|l 

i:>(P||Q) =Epi:>(F^+'^'^-||F^+'^) 

Hence the minimization problem here is to minimize the I-divergence between 
the distributions of under P and Q and the I-divergence between the 

conditional probability measures P^+\^ and Q^+l^. This explains the form of 
the optimal solution Q*(P). 

The next proposition shows that the original minimization of D{P\\Q) over 
nonnegative matrices Q for a given nonnegative matrix P is equivalent to a 
double minimization over the sets V and Q. 

Proposition 4.3 Let P be given. It holds that 

mmD(P\\Q)^ min i:>(P||Q). 
Qes PeV.QeQ 

Proof. With P* — P*{Q), the optimal solution of the partial minimization 
over 7^, we have 

i^(P||Q) >i^(P*||Q) 

^D{P\\Q) 

> min D(P\\Q). 

~ QeQ 

It follows that miupgp ^^qD{P\\Q) > miuQ^Q D{P\\Q). 
Conversely, let Q* G Qhe the minimizer of D{P\\Q) and let Q be a correspond- 
ing element in Q. Furthermore, let P G T' be arbitrary. Then we have 

D{P\\Q*)>D{P*{Q)\\Q) 

> min £'(P||Q), 
PeV.QeQ 

which shows the other inequality. □ 



5 Alternating minimization algorithm 

The results of the previous section are aimed at setting up an alternating mini- 
mization algorithm for obtaining ming D{P\ \Q), where P is a given nonnegative 
matrix. In view of proposition ^3] we can lift this problem to the (7^, Q) space. 
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Starting with an arbitrary Qi G Q with strictly positive elements, we adopt the 
following recursive scheme 

Qri ^ P?7 ^ Q?i+1 ^ Pri+1, (12) 

where P„ = P*(Q„), Q„+i = Q*(P„) and P„+i = P*(Q„+i). The two 
Pythagorean rules from lemma Wa\ now take the forms 

n 1 1 

) + D{Pn+l\\Q n-\-l J 
DiPnWQn) = D{Pn\\Qn+l) + i?(Q„+l | |Qn). 

Addition of these two equations results in 

n 1 1 Pn+1 

) + £'(P„+i||Q„+i) + D(Q„_|_i||Q„), 

and together with this becomes 

Z?(F||g„) = i?(P„||P„+i) + i?(P||Q„+i) + i?(Q„+i||Q„). (13) 

This equation also shows that D{P\\Qn) > D(P||Q„+i). The procedure out- 
lined in equation (fT^ will be made expHcit, using equations © and (0). 
Since it is our aim to apply the above results to the problem as sketched in sec- 
tion |21 we now turn back to the notation of that section. So, instead of Q- we 
write W, instead of Q+ we write H, instead of Q we write WH, of course these 
will be endowed with superscript indices n and n + I below, and P becomes 
V again. From (|12() we get Qn+i = Q*(P*(Qn)) and combining this with the 
substitution of Q into (jSJ we obtain in the original notation- 

Y (W^"^")y ' 
which is just (0). Of course (PJ can be derived similarly. 

6 Discussion of the algorithm 

In the previous section we have shown that the update rules Q and Q are the 
result of an alternating minimization procedure. The convergence properties of 
the algorithm can be studied using the general results of 1,. Due to the sim- 
ilarity with the EM algorithm one may expect similar convergence properties, 
see [11]. 

At each iteration the I-divergence between V and the W^H" is reduced, equiv- 
alently the sequence F{W", i/") is increasing. This follows from equation 
Secondly, once the algorithm reaches a stationary {W, 7f)-point of F (the par- 
tial derivatives vanish here), the updated values are exactly equal to the given 
values. This can be immediately seen by computing the fist order necessary 
conditions for a stationary point and comparing these to the update formulas. 
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Moreover, as long as the algorithm does not reach a stationary point there will 
always be a strict increase in the objective function F. In the third place, all 
the W"" and -ff" evolve in a compact set. For the iJ" this is trivial, since they 
are nonnegative row stochastic matrices. For the this follows from 
since W-""*"^ < Vij (starting the algorithm with matrices that have strictly 
positive elements ensures that all VF" and have strictly positive elements). 
A detailed account of the properties of the algorithm is deferred to another 
publication. 
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